Goto

Collaborating Authors

 exp null



Improving the Knowledge Gradient Algorithm

Neural Information Processing Systems

The knowledge gradient (KG) algorithm is a popular policy for the best arm identification (BAI) problem. It is built on the simple idea of always choosing the measurement that yields the greatest expected one-step improvement in the estimate of the best mean of the arms.


Appendix for " Fine-Grained Theoretical Analysis of Federated Zeroth-Order Optimization "

Neural Information Processing Systems

The main notations of this paper are summarized in Table 1. Table 1: Descriptions of the main notations used in this work.Notations Descriptions N, n the total number of clients and the total sample number of each client S, S We first introduce the lemmas which will be used in our proofs. Let e be the base of the natural logarithm. The stated result in Part (b) is proved. The optimization bound is given.




Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian

Zhang, Yiran, Xu, Weihang, Zhou, Mo, Fazel, Maryam, Du, Simon Shaolei

arXiv.org Artificial Intelligence

Score matching has become a central training objective in modern generative modeling, particularly in diffusion models, where it is used to learn high-dimensional data distributions through the estimation of score functions. Despite its empirical success, the theoretical understanding of the optimization behavior of score matching, particularly in over-parameterized regimes, remains limited. In this work, we study gradient descent for training over-parameterized models to learn a single Gaussian distribution. Specifically, we use a student model with $n$ learnable parameters and train it on data generated from a single ground-truth Gaussian using the population score matching objective. We analyze the optimization dynamics under multiple regimes. When the noise scale is sufficiently large, we prove a global convergence result for gradient descent. In the low-noise regime, we identify the existence of a stationary point, highlighting the difficulty of proving global convergence in this case. Nevertheless, we show convergence under certain initialization conditions: when the parameters are initialized to be exponentially small, gradient descent ensures convergence of all parameters to the ground truth. We further prove that without the exponentially small initialization, the parameters may not converge to the ground truth. Finally, we consider the case where parameters are randomly initialized from a Gaussian distribution far from the ground truth. We prove that, with high probability, only one parameter converges while the others diverge, yet the loss still converges to zero with a $1/τ$ rate, where $τ$ is the number of iterations. We also establish a nearly matching lower bound on the convergence rate in this regime. This is the first work to establish global convergence guarantees for Gaussian mixtures with at least three components under the score matching framework.


Translation-equivariant Representation in Recurrent Networks with a Continuous Manifold of Attractors: Supplementary Information Wen-Hao Zhang

Neural Information Processing Systems

Based on the requirement of equivariant representation (Eq. Since the translation is continuous, the amount of translation can be made infinitesimally small. Also in Eq. (S3) we define ˆ p Differentiating the above equation we can derive a differential form of a translation operator, d ˆ T (a) da = ˆp exp( a ˆ p) = ˆ p ˆ T ( a). (S7) If the Gaussian ansatz was correct, based on Eq. (8a) they should satisfy that u( x s) = ρ null W We performed perturbative analysis to analyze the stability of the CAN dynamics. Substituting above equation into the modified CAN dynamics (Eq. Therefore, Eq. (S16) can be simplified as, τ t u(x s) + τ null S15e), we could project the Eq. ( n 2) Similar with the analysis in the CAN, we propose the following Gaussian ansatz of network's For simplicity, we assume the speed neurons' responses The projection is computing the inner product between the network dynamics (Eq.



Non-asymptotic Convergence of Training Transformers for Next-token Prediction

Neural Information Processing Systems

NTP is limited, with existing studies focusing mainly on asymptotic performance. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer consisting of a self-attention module followed by a feed-forward layer.